Skip to content

Add persistent program cache for Program.compile#1912

Open
cpcloud wants to merge 28 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178
Open

Add persistent program cache for Program.compile#1912
cpcloud wants to merge 28 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

Adds a persistent on-disk cache for cuda.core.Program.compile outputs. The high-level integration is one keyword on Program.compile:

from cuda.core import Program, ProgramOptions
from cuda.core.utils import FileStreamProgramCache

source = 'extern "C" __global__ void k(int *a){ *a = 1; }'
options = ProgramOptions(arch="sm_80")

with FileStreamProgramCache() as cache:  # default: $XDG_CACHE_HOME/cuda-python/program-cache
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    obj.get_kernel("k")

A second invocation with the same inputs short-circuits the entire NVRTC compile — cache.get(key) (one stat + one read) and an ObjectCode._init from the bytes. No Program_compile is invoked. This is the fast path the cache exists to provide:

# Fresh process / second run -- same source, same options.
with FileStreamProgramCache() as cache:
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    # ~10us round-trip on a warm page cache, vs hundreds of ms to seconds
    # for an actual NVRTC invocation.

Public API

  • Program.compile(target_type, *, cache=...) — convenience wrapper. Derives the key, returns a fresh ObjectCode on hit, stores the compile output on miss.
  • cuda.core.utils.ProgramCacheResource — abstract bytes-in / bytes-out interface for custom backends. Provides get, update (Mapping or pairs), clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__). __contains__ is intentionally omitted: cache.get(key) is the recommended idiom because the two-call if key in cache: cache[key] pattern is racy across processes.
  • cuda.core.utils.InMemoryProgramCache — single-process LRU on OrderedDict, threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.
  • cuda.core.utils.FileStreamProgramCache — directory of atomic per-entry files. Safe across processes via os.replace + Windows sharing-violation retries on os.replace / read / unlink.
  • cuda.core.utils.make_program_cache_key — escape hatch when the compile inputs require an extra_digest (include_path, pre_include, pch, use_pch, pch_dir, NVVM use_libdevice=True, NVRTC options.name with a directory component). Program.compile(cache=...) rejects those compiles with a ValueError pointing here.

On-disk format

Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (cuobjdump, nvdisasm, cuda-gdb).

ObjectCode.symbol_mapping from name_expressions is not preserved across a cache round-trip; the wrapper rejects Program.compile(name_expressions=..., cache=...) outright so the first-call-works/second-call-breaks footgun can't surface. Callers that need get_kernel(name_expression) should compile without cache=.

FileStreamProgramCache

  • Atomic writes: stage to tmp/, fsync, os.replace into entries/<2char>/<hash>. Concurrent readers never observe partial writes. Windows os.replace retries on ERROR_ACCESS_DENIED / ERROR_SHARING_VIOLATION / ERROR_LOCK_VIOLATION (winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads and path.unlink so eviction doesn't crash the writer that triggered it on win-64.
  • Sharing-violation predicate: _is_windows_sharing_violation(exc) filters EACCES only when winerror is absent — non-sharing winerrors are real config errors and propagate. Off-Windows PermissionError always propagates.
  • Transparent input forms: cache[key] = value (and cache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or any ObjectCode (path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.
  • Size-only bound: max_size_bytes is the only knob — no element-count cap. None means unbounded.
  • True LRU via atime: every successful read calls os.utime (fd-based on Linux/macOS via os.supports_fd, path-based on Windows) to bump st_atime regardless of mount options or NtfsDisableLastAccessUpdate. Eviction sorts by oldest st_atime first. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.
  • Stat-guarded prunes: clear(), _enforce_size_cap(), and the atime touch all snapshot (ino, size, mtime_ns) per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.
  • Cache key derivation (make_program_cache_key): a backend-strategy pattern with one class per code_type (_NvrtcBackend / _LinkerBackend / _NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validates code_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.
  • NVRTC options.name with a directory component: rejected without extra_digest because NVRTC resolves quoted #include directives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.
  • PTX-loadability warning on cache hit: when the active driver can't load freshly-generated PTX, the wrapper emits the same RuntimeWarning the uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.
  • Default cache directory: when path is omitted, resolves via platformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":
    • Linux/BSD: \$XDG_CACHE_HOME/cuda-python/program-cache (default ~/.cache/cuda-python/program-cache)
    • macOS: ~/Library/Caches/cuda-python/program-cache
    • Windows: %LOCALAPPDATA%\\cuda-python\\program-cache
  • tmp/ self-heal: if something deletes tmp/ after the cache is opened, the next write recreates it rather than crashing with FileNotFoundError.
  • Crashed-writer cleanup: stale temp files older than 1 hour are swept on open and on size-cap enforcement.

Test plan

  • tests/test_program_cache.py — abstract-class contract, update accepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backed ObjectCode / path-backed ObjectCode all round-trip to the same on-disk bytes), make_program_cache_key semantics (deterministic, supported-target matrix mirrors Program.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTC options.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir uses platformdirs, _is_windows_sharing_violation predicate's truth table including the regression case (non-sharing winerror plus EACCES propagates), tmp/ recreation after external wipe.
  • tests/test_program_cache_multiprocess.py — concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.
  • tests/test_program_compile_cache.pyProgram.compile(cache=...) miss/hit/error paths against a recording stub, name_expressions rejection, extra_digest-required / side-effect / NVRTC options.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28
Comment thread cuda_core/docs/source/api.rst Outdated
Comment thread cuda_core/cuda/core/utils/_program_cache/_abc.py Outdated
Comment thread cuda_core/cuda/core/utils/_program_cache/_abc.py Outdated
Comment on lines +25 to +26
Intentionally does NOT subclass ``ProgramCacheResource`` -- the wrapper
should be duck-typed, so we test the duck-typed surface directly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we require each cache= instance to subclass from ProgramCacheResource?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 701d00bProgram.compile now isinstance-checks cache= against ProgramCacheResource and raises TypeError up front. The recording test cache subclasses the ABC; added a regression test that a duck-typed cache is rejected.

Comment thread cuda_core/cuda/core/_program.pyx Outdated

cpdef bint _can_load_generated_ptx() except? -1:
"""Check if the driver can load PTX generated by the current NVRTC version."""
def _can_load_generated_ptx():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpdef functions are also accessible from Python, is this change needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 821131a — restored cpdef bint _can_load_generated_ptx() except? -1. The two cache-hit warning tests now mock driver_version (to (0, 0, 0) for the warning case, (999, 0, 0) for the no-warning case) so the cpdef body computes the desired result without needing a Python-level test seam.

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the persistent program cache implementation. Findings are categorized inline as:

  • Critical (1): Must fix before merging — cache-write failure drops a successfully compiled ObjectCode.
  • Consideration (8): Performance/functionality concerns worth discussing; can be deferred but should be tracked.
  • Nitpick (6): Not blockers for merging.

This excludes items already captured in Leo's and rwgk's earlier review comments (platformdirs removal, over-eviction race, source-directory include guard, ObjectCode._init in docstrings, cpdefdef change, duck-typed test, close() vs context manager, multi-GPU usage, star-import laziness, doc section placement, SQLiteProgramCache removal).

Overall this is a well-engineered piece of work — the TOCTOU handling, stat-guards, and atomic-write design are thorough and well-documented. The main concern is the cache-write failure path losing the compile result.

Comment thread cuda_core/cuda/core/_program.pyx
Comment thread cuda_core/cuda/core/_program.pyx
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_file_stream.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_file_stream.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up batch — 6 additional inline comments from the first review round:

  • Consideration (3): temp file burst-write thrashing, O(n) _enforce_size_cap on every write, UTF-8 decode introducing a new failure mode in the cache path.
  • Nitpick (3): stat-key triple dedup, Windows sharing-retry dedup, _KeyBackend class hierarchy vs simpler function dispatch.

Comment thread cuda_core/cuda/core/utils/_program_cache/_file_stream.py
with contextlib.suppress(FileNotFoundError):
tmp_path.unlink()
raise
self._enforce_size_cap()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: _enforce_size_cap is O(n) on every __setitem__.

Every write stats all files in entries/ plus tmp/ to compute the total. For a large cache (thousands of entries), this could be measurably costly on every compile. An incremental size tracker (add on write, subtract on eviction, periodic reconciliation to correct drift from external deletions) would make writes O(1) in the common case, falling back to a full scan only when reconciliation is needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this is an optimization, not a correctness issue. The cache is bounded by max_size_bytes, so the typical case walks a few hundred cubins per write at microsecond stat cost. An incremental size tracker with periodic reconciliation is the right answer for very large caches; deferring to a follow-up.

Comment thread cuda_core/cuda/core/_program.pyx
Comment thread cuda_core/cuda/core/utils/_program_cache/_file_stream.py Outdated
Comment thread cuda_core/cuda/core/utils/_program_cache/_file_stream.py Outdated
Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py
cpcloud added 25 commits May 5, 2026 04:06
Move the Program caches autosummary out of the trailing
cuda.core.utils block and into a subsection of CUDA compilation
toolchain so the cache classes sit next to Program/Linker. The
subsection switches the current module to cuda.core.utils for the
autosummary and switches back afterwards. make_program_cache_key
moves with the cache classes.
The default close() on ProgramCacheResource is a no-op because the
in-tree backends happen not to hold long-lived state. Future
backends will -- file handles, sockets, db connections -- so the
docstring now spells that out and tells callers to use the context
manager or call close() explicitly so their code is portable across
backends.
Two docstring examples (in ProgramCacheResource and
make_program_cache_key) reached into the private cuda.core._module
path and used the private ObjectCode._init constructor. Switch to
the public from cuda.core import ObjectCode plus
ObjectCode.from_cubin(...) so users learning the cache API don't
get steered at private surface.
…source

cache= used to accept any object with the get/__setitem__ duck-typed
shape. Tighten the contract: the wrapper now isinstance-checks against
ProgramCacheResource and raises TypeError up front when callers pass
a plain dict-like. Subclasses get the get/update/close/__enter__/__exit__
defaults from the ABC and a portable interface across backends.

Update the recording test cache to subclass the ABC and add a regression
test that a duck-typed cache is rejected with TypeError.
The PTX cache-hit warning tests used to monkeypatch
_can_load_generated_ptx itself, which forced the helper to a plain def
so Cython would not early-bind the in-module call past the patch.
Restore it to cpdef bint ... except? -1 and instead pin
driver_version() in the test to (0, 0, 0) (warning expected) or
(999, 0, 0) (no warning) so the cpdef body computes the desired
result. Cython compiles each global lookup inside the helper as a
fresh module-globals fetch, so swapping driver_version on the
module object is enough to steer the comparison.
_LinkerBackend.validate, option_fingerprint, and hash_version_probe each
re-probed _decide_nvjitlink_or_driver(), so a flapping probe could mint
a key whose option fingerprint and version probe disagreed on which
linker is in use.

Cache the decision (and any probe exception) on a per-instance basis,
instantiate _BACKENDS_BY_CODE_TYPE entries fresh per make_program_cache_key
call so the cache lives exactly one call, and thread the decision into
_linker_backend_and_version() instead of letting it probe a third time.

Tests that monkeypatched _linker_backend_and_version now accept the
extra use_driver argument (or *args/**kwargs in the failure-path test).
code_type was already normalised at Program init, but target_type was
checked case-sensitively against {"ptx", "cubin", "ltoir"} in
Program_compile (so compile(target_type="PTX") used to raise) and the
cache key path inherited the same asymmetry from
make_program_cache_key. Lowercase target_type at the top of
Program.compile and at the entry to make_program_cache_key so callers
who pass "PTX" get the same dispatch and the same cache key as "ptx".
…IELD_GATES

The names tuple and the gates dict had to be kept in sync by hand: a
field added to one but forgotten in the other would silently slip out
of the PTX fingerprint. Drop the tuple and iterate the dict directly,
so the dict is the single source of truth for which ProgramOptions
fields perturb a PTX cache key.
_extract_bytes used to let a bare FileNotFoundError bubble up from
Path(code).read_bytes(), so cache[key] = obj failures pointed only at
the missing path with no hint that the cache was reading a path-backed
ObjectCode. Wrap the FileNotFoundError with a message that names both
the cache operation and the missing file so debugging the case stays
self-explanatory.
The reader test only checked that every read returned non-None; a
half-written file with non-empty bytes would pass. Carry the seeded
payload into the worker, count exact-byte mismatches, and require
zero. The eviction-race test wrote a final uncontested cache[key] =
payload + b"final" after the churner exited, so the post-race
endswith assertion would pass even with a broken stat-guard. Drop the
final write and assert the entry survives carrying a rewriter payload
prefix -- if the stat-guarded eviction path is broken, the in-race
write is the one that vanishes.
… overwrite test

InMemoryProgramCache claims thread-safety via the RLock that wraps every
method but had no concurrent-thread coverage. Add a stress test with 4
writers + 4 readers x 200 ops against a size-capped cache that verifies
no exceptions, no deadlocks (RLock reentrance through __setitem__ ->
_evict_to_caps -> popitem), and that internal accounting (_total_bytes,
len(_entries), len(cache)) stays consistent under contention.

FileStreamProgramCache had no overwrite test analogous to
test_inmemory_cache_overwrite_replaces_value_and_updates_size. Add one
that writes a key twice and asserts the second value reads back, len
stays at 1, and exactly one entry file lives on disk -- so a leaked
entry from a botched os.replace would surface here.
max_size_bytes=0 used to slip past the >=0 guard but turned the cache
into a black hole: every write was immediately evicted on its own
size-cap pass. There is no legitimate use for that, so tighten the
guard to >0 (or None for unbounded) and update both backends and the
matching tests.
The (st_ino, st_size, st_mtime_ns) triple was open-coded in
_touch_atime (fd-based and path-based fallbacks),
_prune_if_stat_unchanged, and _enforce_size_cap. Centralise the
fingerprint as _stat_key(st) so all four readers compare the same
fields and the invariant has one place to read.
Replace, stat-and-read, and unlink each carried their own copy of the
_REPLACE_RETRY_DELAYS / sleep / try-op / PermissionError loop. Centralise
the loop as _with_sharing_retry(op, on_exhausted=...) and let each
caller plug in its own success-on-success and exhausted-budget
behaviour. Net behaviour is unchanged (exhaustion semantics for each
public helper are preserved via the on_exhausted callback).
FileStreamProgramCache shards cache files into
entries/<digest[:2]>/<digest[2:]>, so the overwrite test's iterdir
filter on entries/ saw only the digest-prefix subdir (no is_file()
match) and reported 0 entries. Switch to rglob so the assertion
counts actual entry files. CI from the previous push caught this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement compiler caches

4 participants